Primary Question:
How do these variable influence the amount being tipped?
Follow along using Tips-Example.R
What do you know already?
For Windows or OS X:
A server recorded the tips they received over about 10 weeks, including several variables:
How do these variable influence the amount being tipped?
Follow along using Tips-Example.R
Load the tips data using read.csv()
tips <- read.csv("https://bit.ly/2fQoMP1")The head() function shows the first few rows of the data:
head(tips)
## total_bill tip sex smoker day time size
## 1 16.99 1.01 Female No Sun Dinner 2
## 2 10.34 1.66 Male No Sun Dinner 3
## 3 21.01 3.50 Male No Sun Dinner 3
## 4 23.68 3.31 Male No Sun Dinner 2
## 5 24.59 3.61 Female No Sun Dinner 4
## 6 25.29 4.71 Male No Sun Dinner 4How big is the dataset? What types of variables are in each column?
str(tips)
## 'data.frame': 244 obs. of 7 variables:
## $ total_bill: num 17 10.3 21 23.7 24.6 ...
## $ tip : num 1.01 1.66 3.5 3.31 3.61 4.71 2 3.12 1.96 3.23 ...
## $ sex : Factor w/ 2 levels "Female","Male": 1 2 2 2 1 2 2 2 2 2 ...
## $ smoker : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ day : Factor w/ 4 levels "Fri","Sat","Sun",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ time : Factor w/ 2 levels "Dinner","Lunch": 1 1 1 1 1 1 1 1 1 1 ...
## $ size : int 2 3 3 2 4 4 2 4 2 2 ...R can easily summarize each variable in the dataset:
summary(tips)
## total_bill tip sex smoker day
## Min. : 3.07 Min. : 1.000 Female: 87 No :151 Fri :19
## 1st Qu.:13.35 1st Qu.: 2.000 Male :157 Yes: 93 Sat :87
## Median :17.80 Median : 2.900 Sun :76
## Mean :19.79 Mean : 2.998 Thur:62
## 3rd Qu.:24.13 3rd Qu.: 3.562
## Max. :50.81 Max. :10.000
## time size
## Dinner:176 Min. :1.00
## Lunch : 68 1st Qu.:2.00
## Median :2.00
## Mean :2.57
## 3rd Qu.:3.00
## Max. :6.00First, we need to install and load ggplot2, a library for plotting the data
install.packages("ggplot2")
library(ggplot2)What is the relationship between total bill and tip value?
qplot(tip, total_bill, geom = "point", data = tips)Color the points by meal. Is there a difference?
qplot(tip, total_bill, geom = "point", data = tips, colour = time)Add a linear regression line to the plot
qplot(tip, total_bill, geom = "point", data = tips) +
geom_smooth(method = "lm")Tips are usually based on a percentage of the total bill.
Make a new variable for the tipping rate = tip / total bill
# New variable rate is a combination of
# other variables in the tips dataset
tips$rate <- tips$tip / tips$total_bill
summary(tips$rate)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.03564 0.12910 0.15480 0.16080 0.19150 0.71030What is the distribution of tipping rates?
qplot(rate, data = tips, binwidth = .01)One person tipped over 70%, who are they?
tips[which.max(tips$rate),]
## total_bill tip sex smoker day time size rate
## 173 7.25 5.15 Male Yes Sun Dinner 2 0.7103448Look at the average tipping rate for men and women seperately
mean(tips$rate[tips$sex == "Male"])
## [1] 0.1576505
mean(tips$rate[tips$sex == "Female"])
## [1] 0.1664907There is a difference but is it statistically significant?
t.test(rate ~ sex, data = tips)
##
## Welch Two Sample t-test
##
## data: rate by sex
## t = 1.1433, df = 206.76, p-value = 0.2542
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.006404119 0.024084498
## sample estimates:
## mean in group Female mean in group Male
## 0.1664907 0.1576505Boxplots are useful for comparing the distribution of data. Do smokers tip at different rates than non-smokers?
qplot(smoker, rate, geom = "boxplot", data = tips)Try playing with chunks of code from this session to further investigate the tips data:
summary(tips$total_bill)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.07 13.35 17.80 19.79 24.13 50.81qplot(day, rate, geom = "boxplot", data = tips)mean(tips$tip[tips$smoker == "Yes"])
## [1] 3.00871The help() function is useful for getting help with a function:
help(head)The ? function also works:
?headWhen searching for results online, it is helpful to use R + CRAN + <query> to get good results.
A copy of the R reference card is available at:
http://cran.r-project.org/doc/contrib/Short-refcard.pdf
This card contains short versions of the most common functions used in R.
R can perform simple mathematical operations.
# Addition and Subtraction
2 + 5 - 1
## [1] 6
# Multiplication
109*23452
## [1] 2556268
# Division
3/7
## [1] 0.4285714Here are a few more complex operations:
# Integer division
7 %/% 2
## [1] 3
# Modulo operator (Remainder)
7 %% 2
## [1] 1
# Powers
1.5 ^ 3
## [1] 3.375# Exponentiation
exp(3)
## [1] 20.08554
# Logarithms
log(3)
## [1] 1.098612
log(3, base = 10)
## [1] 0.4771213# Trig functions
sin(0)
## [1] 0
cos(0)
## [1] 1
tan(pi/4)
## [1] 1Variables in R are created using the assignment operator, <-:
x <- 5
R_awesomeness <- Inf
MyAge <- 21 #HahaThese variables can then be used in computation:
log(x)
## [1] 1.609438
MyAge ^ 2
## [1] 441c, q, t, C, D, F, T, Ifor, in, while, if, else, repeat, break, nextError messages:
# Variable starts with a number
1age <- 3## Error: <text>:2:2: unexpected symbol
## 1: # Variable starts with a number
## 2: 1age
## ^
Error messages:
# Case Sensitive
Age <- 3
age## Error in eval(expr, envir, enclos): object 'age' not found
Error messages:
# Special Words can't be variable names
for <- 3## Error: <text>:2:5: unexpected assignment
## 1: # Special Words can't be variable names
## 2: for <-
## ^
# This is a VERY bad idea:
T <- FALSE
F <- TRUE
T == FALSE
## [1] TRUE
F == TRUE
## [1] TRUE
rm(T, F) # Fix it!
T == FALSE
## [1] FALSENote: In R, T and F are shorthand for TRUE and FALSE
A variable can contain more than one value.
A vector is a variable which contains a set of values of the same type.
The c() (combine) function is used to create vectors:
y <- c(1, 5, 3, 2)
z <- c(y, y)R performs operations on the entire vector at once:
y / 2
## [1] 0.5 2.5 1.5 1.0
z + 3
## [1] 4 8 6 5 4 8 6 5Vectors can be modified using indexing:
# Get the total bill out of the tips dataset
bill <- tips$total_bill
x <- bill[1:5]
x
## [1] 16.99 10.34 21.01 23.68 24.59
x[1] <- 20
x
## [1] 20.00 10.34 21.01 23.68 24.59Elements of a vector must all be the same type:
head(bill)
## [1] 16.99 10.34 21.01 23.68 24.59 25.29
bill[5] <- ":-("
head(bill)
## [1] "16.99" "10.34" "21.01" "23.68" ":-(" "25.29"By changing a value to a string, all the other values were changed to strings as well.
Using the R Reference Card (and the Help pages, if needed), do the following:
rep function to construct the following vector: 1 1 2 2 3 3 4 4 5 5rep to construct this vector: 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5iris datasetdata(iris)
# first way:
nrow(iris)
## [1] 150
ncol(iris)
## [1] 5iris dataset# second way:
dim(iris)
## [1] 150 5
# third way:
str(iris) # look at the top line
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...rep# Use the `rep` function to construct the following vector:
# 1 1 2 2 3 3 4 4 5 5
rep(c(1:5), each = 2)
## [1] 1 1 2 2 3 3 4 4 5 5# Use `rep` to construct this vector:
# 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
rep(c(1:5), times = 3)
## [1] 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5A vector is a list of values that are all the same type.
Vectors can be created using the c() or rep() function.
To create a vector of consecutive values, use the : function:
a <- 10:15
a
## [1] 10 11 12 13 14 15Elements of a vector can be extracted using brackets:
a[1]
## [1] 10
a[5]
## [1] 14Indexes can also be more complicated:
a[c(1, 3, 5)]
## [1] 10 12 14
a[1:5]
## [1] 10 11 12 13 14Logical vectors can be used for indexing as well:
x <- c(2, 3, 5, 7)
x[c(TRUE, FALSE, FALSE, TRUE)]
## [1] 2 7
x > 3.5
## [1] FALSE FALSE TRUE TRUE
x[x > 3.5]
## [1] 5 7# Get the rate variable out of the tips dataset
rate <- tips$rate
head(rate)
## [1] 0.05944673 0.16054159 0.16658734 0.13978041 0.14680765 0.18623962
sad_tip <- rate < 0.10
rate[sad_tip]
## [1] 0.05944673 0.07180385 0.07892660 0.05679667 0.09935739 0.05643341
## [7] 0.09553024 0.07861635 0.07296137 0.08146640 0.09984301 0.09452888
## [13] 0.07717751 0.07398274 0.06565988 0.09560229 0.09001406 0.07745933
## [19] 0.08364236 0.06653360 0.08527132 0.08329863 0.07936508 0.03563814
## [25] 0.07358352 0.08822232 0.09820426A collection of vectors, similar to a table in an Excel spreadsheet
$tips is a data frame:
head(tips)
## total_bill tip sex smoker day time size rate
## 1 16.99 1.01 Female No Sun Dinner 2 0.05944673
## 2 10.34 1.66 Male No Sun Dinner 3 0.16054159
## 3 21.01 3.50 Male No Sun Dinner 3 0.16658734
## 4 23.68 3.31 Male No Sun Dinner 2 0.13978041
## 5 24.59 3.61 Female No Sun Dinner 4 0.14680765
## 6 25.29 4.71 Male No Sun Dinner 4 0.18623962tips$sex shows the sex column of tips
tips$sex[1:20]
## [1] Female Male Male Male Female Male Male Male Male Male
## [11] Male Female Male Male Female Male Female Male Female Male
## Levels: Female Male
# Show the first 20 items in the sex column of tipssum function on a logical vector to calculate how many TRUEs are in the vector:sum(c(TRUE, TRUE, FALSE, TRUE, FALSE))
## [1] 3sum(tips$rate > .2)
## [1] 39sum(tips$total_bill[tips$rate > .2])
## [1] 619.23mode or class to find out information about variablesstr is useful to find information about the structure of your datastr(tips)
## 'data.frame': 244 obs. of 8 variables:
## $ total_bill: num 17 10.3 21 23.7 24.6 ...
## $ tip : num 1.01 1.66 3.5 3.31 3.61 4.71 2 3.12 1.96 3.23 ...
## $ sex : Factor w/ 2 levels "Female","Male": 1 2 2 2 1 2 2 2 2 2 ...
## $ smoker : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ day : Factor w/ 4 levels "Fri","Sat","Sun",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ time : Factor w/ 2 levels "Dinner","Lunch": 1 1 1 1 1 1 1 1 1 1 ...
## $ size : int 2 3 3 2 4 4 2 4 2 2 ...
## $ rate : num 0.0594 0.1605 0.1666 0.1398 0.1468 ...class(tips)
## [1] "data.frame"
mode(tips)
## [1] "list"Convert variables to a different type using the as series of functions:
size <- head(tips$size)
size
## [1] 2 3 3 2 4 4
as.character(size)
## [1] "2" "3" "3" "2" "4" "4"
as.numeric("2")
## [1] 2There are a whole variety of useful functions to operate on vectors.
tip <- tips$tip
x <- tip[1:5]
length(x) # Number of elements of a vector
## [1] 5
sum(x) # Sum of elements in a vector
## [1] 13.09Using the basic functions it wouldn’t be hard to compute some basic statistics.
(n <- length(tip))
## [1] 244
(meantip <- sum(tip) / n)
## [1] 2.998279
(standdev <- sqrt(sum((tip - meantip)^2) / (n - 1)))
## [1] 1.383638But these functions are already built in to R.
mean(tip)
## [1] 2.998279
sd(tip)
## [1] 1.383638
summary(tip)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 2.000 2.900 2.998 3.562 10.000
quantile(tip, c(.025, .975))
## 2.5% 97.5%
## 1.1760 6.4625& (elementwise AND)| (elementwise OR)c(T, T, F, F) & c(T, F, T, F)
## [1] TRUE FALSE FALSE FALSE
c(T, T, F, F) | c(T, F, T, F)
## [1] TRUE TRUE TRUE FALSE
# Which are big bills with a poor tip rate?
id <- (bill > 40 & rate < .10)
tips[id,]
## total_bill tip sex smoker day time size rate
## 103 44.30 2.5 Female Yes Sat Dinner 3 0.05643341
## 183 45.35 3.5 Male Yes Sun Dinner 3 0.07717751
## 185 40.55 3.0 Male Yes Sun Dinner 2 0.07398274data(diamonds)?diamonds)ppc for price/carat. Store this variable as a column in the diamonds dataqplot(carat, price, data = diamonds)ppc for price/caratdiamonds$ppc <- diamonds$price / diamonds$caratqplot(ppc, geom = "histogram",
data = diamonds[diamonds$ppc > 10000,])
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.[ ]df[i,j] will select the element in the \(i^{th}\) row and \(j^{th}\) columndf[ ,j] will select the entire \(j^{th}\) column and treat it as a vectordf[i ,] will select the entire \(i^{th}\) row and treat it as a vector$ operatorUse Edgar Anderson’s Iris Data:
flower <- irisSelect Species column (5th column):
flower[,5]
## [1] setosa setosa setosa setosa setosa setosa
## [7] setosa setosa setosa setosa setosa setosa
## [13] setosa setosa setosa setosa setosa setosa
## [19] setosa setosa setosa setosa setosa setosa
## [25] setosa setosa setosa setosa setosa setosa
## [31] setosa setosa setosa setosa setosa setosa
## [37] setosa setosa setosa setosa setosa setosa
## [43] setosa setosa setosa setosa setosa setosa
## [49] setosa setosa versicolor versicolor versicolor versicolor
## [55] versicolor versicolor versicolor versicolor versicolor versicolor
## [61] versicolor versicolor versicolor versicolor versicolor versicolor
## [67] versicolor versicolor versicolor versicolor versicolor versicolor
## [73] versicolor versicolor versicolor versicolor versicolor versicolor
## [79] versicolor versicolor versicolor versicolor versicolor versicolor
## [85] versicolor versicolor versicolor versicolor versicolor versicolor
## [91] versicolor versicolor versicolor versicolor versicolor versicolor
## [97] versicolor versicolor versicolor versicolor virginica virginica
## [103] virginica virginica virginica virginica virginica virginica
## [109] virginica virginica virginica virginica virginica virginica
## [115] virginica virginica virginica virginica virginica virginica
## [121] virginica virginica virginica virginica virginica virginica
## [127] virginica virginica virginica virginica virginica virginica
## [133] virginica virginica virginica virginica virginica virginica
## [139] virginica virginica virginica virginica virginica virginica
## [145] virginica virginica virginica virginica virginica virginica
## Levels: setosa versicolor virginicaSelect Species column with the $ operator:
flower$Species
## [1] setosa setosa setosa setosa setosa setosa
## [7] setosa setosa setosa setosa setosa setosa
## [13] setosa setosa setosa setosa setosa setosa
## [19] setosa setosa setosa setosa setosa setosa
## [25] setosa setosa setosa setosa setosa setosa
## [31] setosa setosa setosa setosa setosa setosa
## [37] setosa setosa setosa setosa setosa setosa
## [43] setosa setosa setosa setosa setosa setosa
## [49] setosa setosa versicolor versicolor versicolor versicolor
## [55] versicolor versicolor versicolor versicolor versicolor versicolor
## [61] versicolor versicolor versicolor versicolor versicolor versicolor
## [67] versicolor versicolor versicolor versicolor versicolor versicolor
## [73] versicolor versicolor versicolor versicolor versicolor versicolor
## [79] versicolor versicolor versicolor versicolor versicolor versicolor
## [85] versicolor versicolor versicolor versicolor versicolor versicolor
## [91] versicolor versicolor versicolor versicolor versicolor versicolor
## [97] versicolor versicolor versicolor versicolor virginica virginica
## [103] virginica virginica virginica virginica virginica virginica
## [109] virginica virginica virginica virginica virginica virginica
## [115] virginica virginica virginica virginica virginica virginica
## [121] virginica virginica virginica virginica virginica virginica
## [127] virginica virginica virginica virginica virginica virginica
## [133] virginica virginica virginica virginica virginica virginica
## [139] virginica virginica virginica virginica virginica virginica
## [145] virginica virginica virginica virginica virginica virginica
## Levels: setosa versicolor virginicaflower$Species == "setosa"
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [12] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [23] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [34] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [45] TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE
## [56] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [67] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [78] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [89] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [100] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [111] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [122] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [133] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [144] FALSE FALSE FALSE FALSE FALSE FALSE FALSEflower[flower$Species=="setosa", ]
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
## 7 4.6 3.4 1.4 0.3 setosa
## 8 5.0 3.4 1.5 0.2 setosa
## 9 4.4 2.9 1.4 0.2 setosa
## 10 4.9 3.1 1.5 0.1 setosa
## 11 5.4 3.7 1.5 0.2 setosa
## 12 4.8 3.4 1.6 0.2 setosa
## 13 4.8 3.0 1.4 0.1 setosa
## 14 4.3 3.0 1.1 0.1 setosa
## 15 5.8 4.0 1.2 0.2 setosa
## 16 5.7 4.4 1.5 0.4 setosa
## 17 5.4 3.9 1.3 0.4 setosa
## 18 5.1 3.5 1.4 0.3 setosa
## 19 5.7 3.8 1.7 0.3 setosa
## 20 5.1 3.8 1.5 0.3 setosa
## 21 5.4 3.4 1.7 0.2 setosa
## 22 5.1 3.7 1.5 0.4 setosa
## 23 4.6 3.6 1.0 0.2 setosa
## 24 5.1 3.3 1.7 0.5 setosa
## 25 4.8 3.4 1.9 0.2 setosa
## 26 5.0 3.0 1.6 0.2 setosa
## 27 5.0 3.4 1.6 0.4 setosa
## 28 5.2 3.5 1.5 0.2 setosa
## 29 5.2 3.4 1.4 0.2 setosa
## 30 4.7 3.2 1.6 0.2 setosa
## 31 4.8 3.1 1.6 0.2 setosa
## 32 5.4 3.4 1.5 0.4 setosa
## 33 5.2 4.1 1.5 0.1 setosa
## 34 5.5 4.2 1.4 0.2 setosa
## 35 4.9 3.1 1.5 0.2 setosa
## 36 5.0 3.2 1.2 0.2 setosa
## 37 5.5 3.5 1.3 0.2 setosa
## 38 4.9 3.6 1.4 0.1 setosa
## 39 4.4 3.0 1.3 0.2 setosa
## 40 5.1 3.4 1.5 0.2 setosa
## 41 5.0 3.5 1.3 0.3 setosa
## 42 4.5 2.3 1.3 0.3 setosa
## 43 4.4 3.2 1.3 0.2 setosa
## 44 5.0 3.5 1.6 0.6 setosa
## 45 5.1 3.8 1.9 0.4 setosa
## 46 4.8 3.0 1.4 0.3 setosa
## 47 5.1 3.8 1.6 0.2 setosa
## 48 4.6 3.2 1.4 0.2 setosa
## 49 5.3 3.7 1.5 0.2 setosa
## 50 5.0 3.3 1.4 0.2 setosaCreate a data frame using data.frame function
mydf <- data.frame(NUMS = 1:5,
lets = letters[1:5],
vehicle = c("car", "boat", "car", "car", "boat"))
mydf
## NUMS lets vehicle
## 1 1 a car
## 2 2 b boat
## 3 3 c car
## 4 4 d car
## 5 5 e boatUse the names function to set that first column to lowercase:
names(mydf)[1] <- "nums"
mydf
## nums lets vehicle
## 1 1 a car
## 2 2 b boat
## 3 3 c car
## 4 4 d car
## 5 5 e boatmtcars is a built in data set like iris.mtcars data.mydf <- data.frame(col1 = 1:6, col2 = rep(c("a", "b"), times = 3))
mydf
## col1 col2
## 1 1 a
## 2 2 b
## 3 3 a
## 4 4 b
## 5 5 a
## 6 6 bmydf[mydf$col2 == "a",]
## col1 col2
## 1 1 a
## 3 3 a
## 5 5 a
mydf
## col1 col2
## 1 1 a
## 2 2 b
## 3 3 a
## 4 4 b
## 5 5 a
## 6 6 bmtcars data.data(mtcars)
mtcars[4,]
## mpg cyl disp hp drat wt qsec vs am gear carb
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1list function[[ ]] to select an objectCreating a list containing a matrix and a vector:
mylist <- list(matrix(letters[1:10], nrow = 2, ncol = 5),
seq(0, 49, by = 7))
mylist
## [[1]]
## [,1] [,2] [,3] [,4] [,5]
## [1,] "a" "c" "e" "g" "i"
## [2,] "b" "d" "f" "h" "j"
##
## [[2]]
## [1] 0 7 14 21 28 35 42 49Use indexing to select the second list element:
mylist[[2]]
## [1] 0 7 14 21 28 35 42 49mylist <- list(vec = 1:6,
df = data.frame(x = 1:2,
y = 3:4,
z = 5:6))mylist[[2]]
## x y z
## 1 1 3 5
## 2 2 4 6mylist[[2]][1,]
## x y z
## 1 1 3 5head(x) - View top 6 rows of a data frametail(x) - View bottom 6 rows of a data framesummary(x) - Summary statisticsstr(x) - View structure of objectdim(x) - View dimensions of objectlength(x) - Returns the length of a vectorExamine the first two values of an object by passing the n parameter to the head function:
head(diamonds, n = 2) # first 2 rows of diamonds data frame
## # A tibble: 2 × 11
## carat cut color clarity depth table price x y z ppc
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl>
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 1417.391
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31 1552.381
tail(diamonds, n = 2) # last 2 rows of diamonds data frame
## # A tibble: 2 × 11
## carat cut color clarity depth table price x y z ppc
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl>
## 1 0.86 Premium H SI2 61.0 58 2757 6.15 6.12 3.74 3205.814
## 2 0.75 Ideal D SI2 62.2 55 2757 5.83 5.87 3.64 3676.000What’s the structure of the object?
str(diamonds) # structure of diamonds data frame
## Classes 'tbl_df', 'tbl' and 'data.frame': 53940 obs. of 11 variables:
## $ carat : num 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
## $ cut : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
## $ color : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
## $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
## $ depth : num 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
## $ table : num 55 61 65 58 58 57 57 55 61 61 ...
## $ price : int 326 326 327 334 335 336 336 337 337 338 ...
## $ x : num 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
## $ y : num 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
## $ z : num 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
## $ ppc : num 1417 1552 1422 1152 1081 ...
str(mylist) # structure of mylist list
## List of 2
## $ vec: int [1:6] 1 2 3 4 5 6
## $ df :'data.frame': 2 obs. of 3 variables:
## ..$ x: int [1:2] 1 2
## ..$ y: int [1:2] 3 4
## ..$ z: int [1:2] 5 6How does R summarize objects?
summary(diamonds) # summarize each column in diamonds
## carat cut color clarity
## Min. :0.2000 Fair : 1610 D: 6775 SI1 :13065
## 1st Qu.:0.4000 Good : 4906 E: 9797 VS2 :12258
## Median :0.7000 Very Good:12082 F: 9542 SI2 : 9194
## Mean :0.7979 Premium :13791 G:11292 VS1 : 8171
## 3rd Qu.:1.0400 Ideal :21551 H: 8304 VVS2 : 5066
## Max. :5.0100 I: 5422 VVS1 : 3655
## J: 2808 (Other): 2531
## depth table price x
## Min. :43.00 Min. :43.00 Min. : 326 Min. : 0.000
## 1st Qu.:61.00 1st Qu.:56.00 1st Qu.: 950 1st Qu.: 4.710
## Median :61.80 Median :57.00 Median : 2401 Median : 5.700
## Mean :61.75 Mean :57.46 Mean : 3933 Mean : 5.731
## 3rd Qu.:62.50 3rd Qu.:59.00 3rd Qu.: 5324 3rd Qu.: 6.540
## Max. :79.00 Max. :95.00 Max. :18823 Max. :10.740
##
## y z ppc
## Min. : 0.000 Min. : 0.000 Min. : 1051
## 1st Qu.: 4.720 1st Qu.: 2.910 1st Qu.: 2478
## Median : 5.710 Median : 3.530 Median : 3495
## Mean : 5.735 Mean : 3.539 Mean : 4008
## 3rd Qu.: 6.540 3rd Qu.: 4.040 3rd Qu.: 4950
## Max. :58.900 Max. :31.800 Max. :17829
##
summary(mylist) # summarize mylist - # values in each item in the list
## Length Class Mode
## vec 6 -none- numeric
## df 3 data.frame listWhat are the dimensions of the object?
dim(diamonds) # dimensions of diamonds data frame
## [1] 53940 11
dim(mylist) # mylist doesn't have dimensions because it isn't a rectangular object
## NULL
length(diamonds) # diamonds is a data frame with 10 columns (or really, a list with 10 vectors that are the same length)
## [1] 11
length(mylist) # mylist has 2 objects
## [1] 2head(mtcars, n = 8)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
## Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2str(mtcars)
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...dim(iris)
## [1] 150 5summary(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
## str(x)iris.subset <- iris[iris$Species != "virginica", ]
t.test(Petal.Length ~ Species, data = iris.subset)
##
## Welch Two Sample t-test
##
## data: Petal.Length by Species
## t = -39.493, df = 62.14, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -2.939618 -2.656382
## sample estimates:
## mean in group setosa mean in group versicolor
## 1.462 4.260Save the output of the t-test to an object
tout <- t.test(Petal.Length ~ Species, data = iris.subset)Look at the structure of the t-test object:
str(tout)
## List of 9
## $ statistic : Named num -39.5
## ..- attr(*, "names")= chr "t"
## $ parameter : Named num 62.1
## ..- attr(*, "names")= chr "df"
## $ p.value : num 9.93e-46
## $ conf.int : atomic [1:2] -2.94 -2.66
## ..- attr(*, "conf.level")= num 0.95
## $ estimate : Named num [1:2] 1.46 4.26
## ..- attr(*, "names")= chr [1:2] "mean in group setosa" "mean in group versicolor"
## $ null.value : Named num 0
## ..- attr(*, "names")= chr "difference in means"
## $ alternative: chr "two.sided"
## $ method : chr "Welch Two Sample t-test"
## $ data.name : chr "Petal.Length by Species"
## - attr(*, "class")= chr "htest"Since this is simply a list, use regular indexing to access the p-value.
tout$p.value
## [1] 9.934433e-46
tout[[3]]
## [1] 9.934433e-46It is generally necessary to import data in to R rather than just using built-in datasets.
setwd() (or an appropriate file path)read.table() for reading in .txt filesread.csv() for reading in .csv filesreadr package has “smarter” versions of these functions and may be more usefulFirst, create a csv file. Use a text editor, excel… Then load it in:
littledata <- read.csv("PretendData.csv")read.table. You may need to look at the help page for read.table in order to properly do this.webcomics <- read.table("./data/FunWebcomics.txt")
webcomics
## V1 V2
## 1 Fun Webcomics URL
## 2 xkcd http://xkcd.com/
## 3 sarah's scribbles http://www.gocomics.com/sarahs-scribbles
## 4 the oatmeal http://theoatmeal.com/
## 5 dinosaur comics http://www.qwantz.com/
## 6 hyperbole and a half http://hyperboleandahalf.blogspot.com/install.packages()library()sos package adds helpful features for searching for packages related to a particular topicggplot2: Statistical graphicsdplyr/tidyr: Manipulating data structuresknitr: integrate LaTeX, HTML, or Markdown with R for easy reproducible researchCode Skeleton:
foo <- function(arg1, arg2, ...) {
# Code goes here
return(output)
}Example:
mymean <- function(data) {
ans <- sum(data) / length(data)
return(ans)
}Skeleton:
if (condition) {
# Some code that runs if condition is TRUE
} else {
# Some code that runs if condition is FALSE
}Example:
mymean <- function(data) {
if (!is.numeric(data)) {
stop("Numeric input is required")
} else {
ans <- sum(data) / length(data)
return(ans)
}
}for (i in 1:3) {
print(i)
}
## [1] 1
## [1] 2
## [1] 3tips <- read.csv("https://bit.ly/2iNqvKM")
id <- c("total_bill", "tip", "size")
for (colname in id) {
print(colname)
}
## [1] "total_bill"
## [1] "tip"
## [1] "size"
for(colname in id) {
print(paste(colname, mymean(tips[, colname])))
}
## [1] "total_bill 19.7859426229508"
## [1] "tip 2.99827868852459"
## [1] "size 2.56967213114754"i <- 1
while (i <= 5) {
print(i)
i <- i + 1
}
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5sd may be useful)sd may be useful)myfun <- function(x) {
m <- mean(x)
s <- sd(x)
return(c(mean = m, sd = s))
}myfun <- function(x) {
if (is.logical(x)) {
x <- as.numeric(x)
}
if (!is.numeric(x)) {
warning("x is not logical or numeric. Cannot compute a mean or std. deviation.")
return(c(mean = NA, sd = NA))
}
m <- mean(x)
s <- sd(x)
return(c(mean = m, sd = s))
}data(diamonds)
diamondStats <- matrix(0, nrow = ncol(diamonds), ncol = 2,
dimnames = list(names(diamonds),
c("mean", "sd")))
for(i in 1:ncol(diamonds)) {
diamondStats[i,] <- myfun(diamonds[[i]])
}
diamondStats
## mean sd
## carat 0.7979397 0.4740112
## cut NA NA
## color NA NA
## clarity NA NA
## depth 61.7494049 1.4326213
## table 57.4571839 2.2344906
## price 3932.7997219 3989.4397381
## x 5.7311572 1.1217607
## y 5.7345260 1.1421347
## z 3.5387338 0.7056988R Markdown is an authoring format that enables easy creation of dynamic documents, presentations, and reports from R. It combines the core syntax of markdown (an easy-to-write plain text format) with embedded R code chunks that are run so their output can be included in the final document. R Markdown documents are fully reproducible (they can be automatically regenerated whenever underlying R code or data changes).
Study the first page of the R Markdown Reference Guide.
Yes, the entire markdown syntax can be described in one page!
Can you think of anything that is missing from the syntax (that you might want when creating documents)?
LaTeX markup, but don’t expect it to convert between output formats.Have a look at R Markdown presentations and templates.
Pro tip: run devtools::install_github("rstudio/rticles") to get more templates
The stuff at the top of the .Rmd file (called yaml front matter) tells rmarkdown what output format to use.
---
title: "Untitled"
date: "May 16, 2016"
output: html_document
---
In this case, when “Knit HTML” is clicked, RStudio calls rmarkdown::render("file.Rmd", html_document()). Default values can be changed (see the source of this presentation).
A code chunk is a concept borrowed from the knitr package (which, in turn, was inspired by literate programming). In .Rmd files, you can start/end a code chunk with three back-ticks.
```{r chunk1}
1 + 1
```
Want to run a command in another language?
```{r chunk2, engine = 'python'}
print "a" + "b"
```
There are a plethora of chunk options in knitr (engine is one of them). Here are some that I typically use:
echo: Show the code?eval: Run the code?message: Relay messages?warning: Relay warnings?fig.width and fig.height: Change size of figure output.cache: Save the output of this chunk (so we don’t have to run it next time)?Study the second page of the R Markdown Reference Guide and go back to the Hello R Markdown example we created.
Easy: Modify the figure sizing and alignment.
Medium: Add a figure caption.
Hard: Can you create an animation? (Hint: look at the fig.show chunk option – you might need to the animation package for this)
Pro Tip: Don’t like the default chunk option value? Change it at the top of the document:
```{r setup2}
knitr::opts_chunk$set(message = FALSE, warning = FALSE)
```
```{r, fig.align = "right", fig.width = 3, fig.height = 3, out.width = "50%"}
qplot(rnorm(100))
```
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
```{r, fig.cap = "Histogram of 100 samples from a normal distribution"}
qplot(rnorm(100))
```
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Histogram of 100 samples from a normal distribution
```{r, fig.show = 'animate', ffmpeg.format = 'mp4'}
samples <- seq(100, 500, 50)
for (i in samples) {
print(
qplot(rnorm(i)) + ggtitle(sprintf("%d Samples from a Normal Dist", i))
)
}
```
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Ugly:
m <- lm(mpg ~ disp, data = mtcars)
summary(m) # output isn't very attractive
##
## Call:
## lm(formula = mpg ~ disp, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.8922 -2.2022 -0.9631 1.6272 7.2305
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 29.599855 1.229720 24.070 < 2e-16 ***
## disp -0.041215 0.004712 -8.747 9.38e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.251 on 30 degrees of freedom
## Multiple R-squared: 0.7183, Adjusted R-squared: 0.709
## F-statistic: 76.51 on 1 and 30 DF, p-value: 9.38e-10Pretty:
pander is one great option.
library(pander)
pander(m)| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| disp | -0.04122 | 0.004712 | -8.747 | 9.38e-10 |
| (Intercept) | 29.6 | 1.23 | 24.07 | 3.577e-21 |
a <- anova(m)
a
## Analysis of Variance Table
##
## Response: mpg
## Df Sum Sq Mean Sq F value Pr(>F)
## disp 1 808.89 808.89 76.513 9.38e-10 ***
## Residuals 30 317.16 10.57
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1pander(a)| Df | Sum Sq | Mean Sq | F value | Pr(>F) | |
|---|---|---|---|---|---|
| disp | 1 | 808.9 | 808.9 | 76.51 | 9.38e-10 |
| Residuals | 30 | 317.2 | 10.57 | NA | NA |
methods(pander)
## [1] pander.anova* pander.aov*
## [3] pander.aovlist* pander.Arima*
## [5] pander.call* pander.cast_df*
## [7] pander.character* pander.clogit*
## [9] pander.coxph* pander.cph*
## [11] pander.CrossTable* pander.data.frame*
## [13] pander.Date* pander.default*
## [15] pander.density* pander.describe*
## [17] pander.evals* pander.factor*
## [19] pander.formula* pander.ftable*
## [21] pander.function* pander.glm*
## [23] pander.Glm* pander.gtable*
## [25] pander.htest* pander.image*
## [27] pander.irts* pander.list*
## [29] pander.lm* pander.lme*
## [31] pander.logical* pander.lrm*
## [33] pander.manova* pander.matrix*
## [35] pander.microbenchmark* pander.mtable*
## [37] pander.name* pander.nls*
## [39] pander.NULL* pander.numeric*
## [41] pander.ols* pander.orm*
## [43] pander.polr* pander.POSIXct*
## [45] pander.POSIXlt* pander.prcomp*
## [47] pander.randomForest* pander.rapport*
## [49] pander.rlm* pander.sessionInfo*
## [51] pander.smooth.spline* pander.stat.table*
## [53] pander.summary.aov* pander.summary.aovlist*
## [55] pander.summary.glm* pander.summary.lm*
## [57] pander.summary.lme* pander.summary.manova*
## [59] pander.summary.nls* pander.summary.polr*
## [61] pander.summary.prcomp* pander.summary.rms*
## [63] pander.summary.survreg* pander.summary.table*
## [65] pander.survdiff* pander.survfit*
## [67] pander.survreg* pander.table*
## [69] pander.tabular* pander.ts*
## [71] pander.zoo*
## see '?methods' for accessing help and source codepander.lm and pander.anova.